An Improved Model of Dotplotting for Text Segmentation

نویسندگان

  • Na Ye
  • Jingbo Zhu
  • Huizhen Wang
  • Matthew Y. Ma
  • Bin Zhang
چکیده

The Dotplotting method has been widely used for text segmentation for its merits in detecting lexical repetition in global context. However, a theoretical analysis of its segmentation criterion function finds several deficiencies. The original function can not make full use of the text structure features and does not suit the text segmentation task very well. We propose an improved model (MMD model) that resolves these deficiencies. Comparative experimental results on the synthetic corpus and real corpus have shown that MMD model reduces the error rate of the original Dotplotting method by more than 20 percent, and outperforms other existing methods derived from Dotplotting.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Automatic EEG Signal Segmentation Method based on Generalized Likelihood Ratio

It is often needed to label electroencephalogram (EEG) signals by segments of similar characteristics that are particularly meaningful to clinicians and for assessment by neurophysiologists. Within each segment, the signals are considered statistically stationary, usually with similar characteristics such as amplitude and/or frequency. In order to detect the segments boundaries of a signal, we ...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Chinese Language and Computing

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2007